CFVisual: an interactive desktop platform for drawing gene structure and protein architecture

Background When researchers perform gene family analysis, they often analyze the structural characteristics of the gene, such as the distribution of introns and exons. At the same time, characteristic structural analysis of amino acid sequence is also essential, for example, motif and domain features. Researchers often integrate these analyses into one image to dig out more information, but the tools responsible for this integration are lacking. Results Here, we developed a tool (CFVisual) for drawing gene structure and protein architecture. CFVisual can draw the phylogenetic tree, gene structure, and protein architecture in one picture, and has rich interactive capabilities, which can meet the work needs of researchers. Furthermore, it also supports arbitrary stitching of the above analysis images. It has become a useful helper in gene family analysis. The CFVisual package was implemented in Python and is freely available from https://github.com/ChenHuilong1223/CFVisual/. Conclusion CFVisual has been used by some researchers and cited by some articles. In the future, CFVisual will continue to serve as a good helper for researchers in the study of gene structure and protein architecture.

1 School of Life Science, North China University of Science and Technology, Tangshan 063210, Hebei, China Full list of author information is available at the end of the article specific numerical information cannot be provided, and the website is often inaccessible, etc.
Motifs and domains are the functional units and characteristic structures of amino acid sequences, and are often identified by tools such as MEME and Pfam/NCBI-CDD/ SMART [2][3][4][5]. Displaying these motifs and domains along a line helps folk understand the structure of the protein sequence. Comparing with other protein sequences is helpful to find out the conserved parts and difference sites. Moreover, combined with the phylogenetic tree, it is helpful to study the evolution of motifs and domains. When conducting gene family analysis, researchers often need to splice the gene structure map with the motif and/or domain location distribution map into one map for display, so as to obtain more information. Therefore, researchers need to use Adobe Illustrator, Adobe Photoshop or other image editing software to stitch the images. To the best of our knowledge, this work is time-consuming and tedious. Therefore, it is important to develop a suitable tool to avoid this situation.

Methods
We used the Python language to write the software implementation logic, then used the Python language PySide2 library to implement the software GUI interface, and then used the Python language matplotlib library to visualize the data via our own logic. Finally, We used the Pyinstaller library in the Python language to complete the creation of the CFVisual platform.
In order to better reflect the advantages of CFVisual, we downloaded the latest rice genome data from the rice database (http:// rice. uga. edu/) [6], including the whole genome protein sequence and GFF3 annotation file, and then used HMMER software (parameter threshold was set to 1e-10) based on the pectinesterase domain Hidden Markov model (PF01095.19) to identify the candidate sequences of rice PME protein [7]. Finally, all candidate protein sequences were determined by Pfam (https:// pfam. xfam. org/), NCBI-CDD (https:// www. ncbi. nlm. nih. gov/ cdd), and SMART (http:// smart. emblheide lberg. de/) databases, and only protein sequences that contain the pectinesterase domain are considered members of the PME gene family.
After that, we wrote a Python script (https:// github. com/ ChenH uilon g1223/ CFVis ual/) to extract the amino acid sequences and GFF3 annotation information of rice PMEs. The amino acid sequences of rice PME were analyzed by MEGA X [8], MEME (https:// meme-suite. org/ meme/), Pfam, NCBI-CDD, and SMART tools to generate the result file. Finally, these results were visualized using CFVisual.

Function overview, usage, and illustrative examples
In the functional aspect, CFVisual can be divided into three parts, namely gene structure level, protein architecture level, and classification and coloring of phylogenetic tree.

Gene structure
Users can provide GFF3, GTF or BED files, and then use CFVisual to draw the picture. In the interface shown in Fig. 1b, users can set the style of each feature, such as color, shape, thickness, etc. Clicking the "Statistics" button to make CFVisual automatically count the length of gene, the number of introns, utrs, cds, and other quantitative information (Fig. 1c). Of course, users can also add other information, including domains and signal peptides, etc. (Fig. 1a). Using the combined form of rectangular boxes helps researchers intuitively judge which cds fragments encode the domain and the presence of introns.
Regarding the promoter map (Fig. 1d), users provide location results from PlantCare [9] and other tools for predicting the position of cis-acting elements and CFVisual will read out all cis-acting elements at once, which can be selectively displayed according to needs.

Protein architecture
The preparation file for drawing the motif diagram (Fig. 1a) is the result file predicted by the MEME tool. Compared with some conventional motif visualization tools, the advantages of CFVisual are as follows. First of all, the software completely reproduces the results of MEME and realizes that the height of the rectangular box representing the motif is negatively correlated with the p value. The lower the height, the higher the p value, and the lower the credibility of the predicted motif. Secondly, the result of "Scanned Sites" can be displayed in the form of transparent rectangular boxes. At last, users can selectively display motif units that need to be studied.
The preparation file of the domain map is the result file of NCBI-CDD, Pfam or SMART. Users can still selectively display the domains that need to be studied. Another advantage of CFVisual is that the structure domain can be superimposed on the motif diagram in the form of a rectangular box (Fig. 1a), so that researchers can intuitively judge the location distribution relationship of motifs and domains.

Classification and coloring of phylogenetic tree
While studying gene structure and protein architecture, researchers often joint a phylogenetic tree to study the evolution of structures. Here, CFVisual supports this demand well. Users only needs to provide the tree file in Newick format to be recognized by CFVisual and can draw the picture easily (Fig. 1a). After that, researchers can use the "Tree Edit Tab" to classify and color the phylogenetic tree, and finally produce high-definition bitmaps and/or editable vector graphics that meet publication quality.

Illustrative examples
To better reflect the above advantages of CFVisual, we take the gene structure, motif, and domain drawing results of the PME gene family of rice as an example.
The gene structure of rice PME is shown in Fig. 2 and the number of structural elements is shown in Table 1. We observed that the average length of rice PME gene is 2802.62 bp, the longest is 8802 bp (LOC_Os01g21034.1), and the shortest is 557 bp (LOC_ According to the number of introns, eukaryotic genes can be divided into three categories: intronless (no introns), intron-poor (three or fewer introns per gene), and intron-rich (more than three introns per gene) [10]. Combined with the phylogenetic relationship, we found that the genes in Group 1 are only intronless (4, 15.38%) and intron-poor (22, 84.62%). Therefore, Group 1 is intron-poor clade. The genes in Group 2 contain these three types of genes, among them, intron-rich is the most (9, 56.25%), followed by intron-poor (6, 37.50%), and the least is intronless (1, 6.25%). Therefore, Group 2 is an intron-rich clade.
Combined with the location of the domains, we found that introns are almost always present in the region encoding the pectinesterase domain, whereas introns are absent in the region encoding the PMEI domain. Intriguingly, for the region encoding the pectinesterase domain, the genes of Group 2 contain more introns, while the genes of Group 1 contain fewer introns.
In conclusion, CFVisual showed the structure of rice PME gene well and provided useful quantitative information, which promoted our understanding and evolution of rice PME gene structure.
The structural motifs and domains along a line representing the amino acid sequence were shown in Fig. 3. We found that motif 10 exists only in the PMEI domain, and is a sequence signature of the PMEI domain. Motif 7, motif 4, motif 5, motif 1, motif 11, motif 3, motif 2, motif 9, motif 6, and motif 12 are contained in the pectinesterase domain. Moreover, we also found some cases of motif repetition and loss, for example, motif 7 located in the pectinesterase domain has a repetition after motif 4, and the PME in Group 1 is relatively intact, while the PME in Group 2 is mostly missing. Interestingly, motif 8 and motif 10 are only present in PMEs in Group 1 and cannot be found in PMEs in Group 2. All in all, rice PME protein sequences are generally conserved and have some obvious differences. From a phylogenetic point of view, the distribution of motifs and domains has obvious specificity. This helps us to better understand the sequence characteristics and evolution of rice PME.

Availability of data and materials
All data generated or analyzed during this study were included in this published article and the Additional files. We have been using public data and do not have produced sequence data by ourselves.